Project : EXPLORE AND SUMMARISE DATA

Name : ELDHOSE PETER

This tidy data set contains 1,599 red wines with 11 variables on the chemical
properties of the wine. At least 3 wine experts rated the quality of each wine,
providing a rating between 0 (very bad) and 10 (very excellent).

Dataset: RED WINE QUALITY

Loading the data set

df<-read.csv('wineQualityReds.csv')
data_new <-subset(df,select=-c(X))

About the dataset

The dataset consist of 1599 observations of 13 variables.
Variable ‘X’ is the id given for each observation.
At least 3 wine experts rated the quality of each wine, providing a rating
between 0 (very bad) and 10 (very excellent).
Except ‘X’ and ‘quality’, all other variables are of datatype ‘numeric’.
‘X’ and quality are of ‘integer’ datatype.

Attribute information:

Input variables (based on physicochemical tests):
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm^3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Quality score ranges between 0(being very bad) and 10(being very excellent)

Description of attributes:
1 - fixed acidity: most acids involved with wine or fixed or
nonvolatile (do not evaporate readily).
2 - volatile acidity: the amount of acetic acid in wine, which at too
high of levels can lead to an unpleasant, vinegar taste.
3 - citric acid: found in small quantities, citric acid can
add ‘freshness’ and flavor to wines.
4 - residual sugar: the amount of sugar remaining after fermentation stops,
it’s rare to find wines with less than 1 gram/liter and wines with greater
than 45 grams/liter are considered sweet.
5 - chlorides: the amount of salt in the wine.
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between
molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
growth and the oxidation of wine.
7 - total sulfur dioxide: amount of free and bound forms of S02; in low
concentrations, SO2 is mostly undetectable in wine, but at free SO2
concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
8 - density: the density of water is close to that of water depending on
the percent alcohol and sugar content.
9 - pH: describes how acidic or basic a wine is on a scale from 0
(very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
10 - sulphates: a wine additive which can contribute to sulfur dioxide
gas (S02) levels, wich acts as an antimicrobial and antioxidant.
11 - alcohol: the percent alcohol content of the wine.
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Reference : https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Structure of the dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Plots Section

Lets plot all the variables in data set. This will provide information regarding
the distribution of various chemical properties in the red wine.

The above graphs shows distribution of various variables in the data set.
Volatile acidity, density and pH appeared to be normally distributed.Some of the
distributions like residual sugar and chlorides are long tailed.
To get better understanding of long tailed distribution, we can use log 10
transformations.

Lets transform the x axis of the long tailed plots using log 10.

The log 10 transformation tranforms the residual sugar and chlorides plots
roughly to normal distributions.

Univariate Analysis

Summary of the dataset

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Occurences of different quality levels in dataset:
Quality:Count
3: 10
4: 53
5:681
6:638
7:199
8:18

What is the structure of your dataset?

The dataset consist of 1599 observations of 13 variables(“X”,“fixed.acidity”
,“volatile.acidity”, “citric.acid”,“residual.sugar”,“free.sulfur.dioxide”,
“chlorides”,“total.sulfur.dioxide”, “density”,“pH”,“sulphates”,“alcohol”,
“quality”).
Variable ‘X’ is the id given for each observation.
Quality : Rated the quality of each wine, providing a rating between 0(very bad)
and 10 (very excellent).
Except ‘X’ and ‘quality’, all other variables are of datatype ‘numeric’. ‘X’ and
quality are of ‘integer’ datatype.

From the histograms plotted, we can see that the red wine quality 5 and 6 have
most occurences in the data set.

Volatile acidity, density and pH appeared to be normally distributed.Some of
the distributions are long tailed. Some of the distributions like residual
sugar and chlorides are long tailed.
The fixed acidity varies from 4 to 16 (g/dm^3) with mean value of 8.32.The
volatile acidity varies from 0.1 to 2(g/dm^3) with a mean of 0.527.The
citric acid varies from 0 to 1(g/dm^3) with a mean of 0.271.The residual
sugar varies from 0.9 to 15(g/dm^3) with a mean of 2.539.The chlorides
varies from 0.01 to 7(g/dm^3) with a mean of 0.087.The free.sulfur.dioxide
varies from 1 to 72(mg/dm^3) with a mean of 15.87. The total.sulfur.dioxide
varies from 6 to 289(mg/dm^3) with a mean of 46.47. The density varies
from 0.9 to 2(g/cm^3) with a mean of 0.996.The pH varies from 2 to 5
with a mean of 3.311.The sulphates varies from 0.3 to 2(g/dm^3) with a
mean of 0.6581. The alcohol varies from 8 to 15(%/volume) with a mean of
10.42. The quality varies from 3 to 8 with a mean of 5.636.

Questions

What is/are the main feature(s) of interest in your dataset?

The main objective of the study is quality of the red wine based on various
factors.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The acidity is indicated by pH level. So pH will be considered as an important
aspect of dataset.

Did you create any new variables from existing variables in the dataset?

I did not create new variable from the existing variables in the dataset.
I created another data by using melt function inorder to plot the histograms
of various variables.

Of the features you investigated, were there any unusual distributions?

Residual sugar and chlorides had a long tail distribution. So I scaled the x
axis using log 10.Also, the quality in the dataset varies from 3 to 8.

Bivariate Plots Section

Lets plot quality of the red wine against all other chemical properties to find
which chemical properties affects the quality of red wine.I have choosen
box plot as they can represent categorical data such as quality in a better way.

The above graphs provided insights to the red wine data set.We can see that some
distributions have higher number of outliers like residual sugar and chlorides
as we saw earlier.From the box plots too, we can see that most of the samples of
red wine in the data set have a quality of 5 and 6.We can clearly see that, with
the increase of alcohol content in red wine, it tends to show better quality.
Similarly, with the decrease in volatile acidity,quality tends to be better.
The box plots also provided information regarding range of values,the median,
the mean values of different chemical properties corresponding to different
quality levels.

Lets analyze this further using correlation plot.This will give us a better
understanding of data and lets see which variables have stronger relationships
with quality of red wine.

Correlation Plot

The above graph shows correlations among different variables in the data set.
The correlation plot shows correlations among different variables in the data
set.Stronger correlations, darker the color.

Variables which were found to exhibit good correlation with quality are:

alcohol : 0.476
volatile acidity : -0.390
sulphates : 0.251
citric acid : 0.226

Bivariate Analysis

After investigating the data set, mainly four varibales were found to exhibit
good correlation with quality.They include alcohol,volatile acidity, sulphates
and citric acid.
Alcohol content has the highest correlation with the red wine quality. The
graph below shows the relationship between the quality and alcohol content.
We have choosen alpha =1/4 and jiter to reduce overplotting.

The volatile acidity is negatively correlated with the quality.

From the graphs we can see an approximate linear relationship between few
variables and quality of the red wine.Alcohol and volatile acidity should be
noted as they showed good linear relationship with quality.
The graphs clearly shows the factors affecting the quality of the red wine.
Good quality red wine tends to have:
Higer alcohol content.
Lower volatile acidity.
Good amount of sulphates and citric acid.

We can see good correlation with some other variables :


fixed acidity and pH : -0.683
fixed acidity and citric acid : 0.672
fixed acidity and density : 0.668
citric acid and Ph : -0.541
citric acid and volatile acidity : -0.552
alcohol and density : -0.496

Lets visualize some of these relationships.

We use alpha = 1/4 to reduce overplotting.

The above graph clearly shows there exist an approximate linear relationship
between these chemical properties.

Median of chemical properties categorized based on quality.

##   quality fixed.acidity volatile.acidity citric.acid residual.sugar
## 1       3          7.50            0.845       0.035            2.1
## 2       4          7.50            0.670       0.090            2.1
## 3       5          7.80            0.580       0.230            2.2
## 4       6          7.90            0.490       0.260            2.2
## 5       7          8.80            0.370       0.400            2.3
## 6       8          8.25            0.370       0.420            2.1
##   chlorides free.sulfur.dioxide total.sulfur.dioxide  density   pH
## 1    0.0905                 6.0                 15.0 0.997565 3.39
## 2    0.0800                11.0                 26.0 0.996500 3.37
## 3    0.0810                15.0                 47.0 0.997000 3.30
## 4    0.0780                14.0                 35.0 0.996560 3.32
## 5    0.0730                11.0                 27.0 0.995770 3.28
## 6    0.0705                 7.5                 21.5 0.994940 3.23
##   sulphates alcohol
## 1     0.545   9.925
## 2     0.560  10.000
## 3     0.580   9.700
## 4     0.640  10.500
## 5     0.740  11.500
## 6     0.740  12.150

Multivariate Plots Section

Multivariate Analysis

We have plotted graphs using the variables alcohol,volatile acidity,density,
fixed acidity and citric acid against quality of the red wine.In all the graphs,
it clearly seen that have higher alcohol content and lower volatile acidity
tends to have better quality.
In the previous sections, we saw that citric acid and volatile acidity are
negatively correlated.This is very mch reflected in graphs plotted as it shows
higher citric acid and lower volatile acidity, better the red wine quality.The
graphs also shows that lower the density and higer the alcohol, we will have
better quality of red wine.

Final Plots and Summary

Plot One

The plot represents variation of quality with alcohol and volatile acidity.

Description One

The graphs show that lower volatile acidity and higher alcohol content tends to
show better quality.Volatile acidity is negatively correlated . The lighter
regression line represents the low quality wines while darker line represents
high quality wines. Stronger correlations, darker the color. The red wine quality is highly correlated with the variables alcohol,volatile
acidity,sulphates and citric acid.Red Wines with high alcohol content tends to
show better quality. They show good correlation with the quality of red wine.
alcohol : 0.476
volatile acidity : -0.390
sulphates : 0.251
citric acid : 0.226

Plot Two

The plot represents variation of quality with volatile acidity and citric acid.

Description Two

The graph shows that the volatile acidity and citric acid are negatively
correlated. Dark regression lines represent high quality wines while light
regression lines represent low quality wines. Low concentration of volatile acid
and high concentration of citric acid tends to show better quality of red wines.

Plot Three

The plot represents variation of quality with sulphates and citric acid.

Description Three

The graph shows that higher citric acid and sulphate is associated with high
quality wine.The dark regression lines represent high quality wines while
light regression lines represent low quality wines.

Reflection

From the univariate plots, I was able to know about the distribution of various
chemical properties in the data set.

Some of the distributions were long tailed. So logarithmic transformations
were used to reduce the effects of outliers.Residual sugar and chlorides had a
long tailed distribution.Some of the graphs were overplotted, so I adjusted the
alpha level and figure size.
In bivariate plots,I have used box plots and sctter plots and it have provided
me a great insight to the chemical properties which significantly contributes
to the red wine quality.
The multivarite variate plots have shown which pair of chemical properties tends
to give better quality of red wine.

One of the things which I noted was that the number of samples of wines with
quality 5 and 6 was significantly higher than others.The study can be improved
with greater number of samples.
One of the things I noted was that many samples of red wine had 0 quantity of
citric acid.

We have found correlation between many variables in the dataset.
The quality of red wines were highly correlated with alcohol,volatile acidity,
citric acid and sulphates.The higher alcohol content tends to give better
quality of wine.The volatile acidity was another variable which was affecting
the wine quality negatively.Lower content of volatile acid and higher content
of citric acid were shown to give better quality of wine.
Higher content of sulphates were also seen to be associated with high quality
of wine.To summarise, higher content of alcohol,citric acid,sulphates and
lower volatile acidity tends to give better quality of wine.
We can see good correlation with some other variables like fixed acidity and pH,
fixed acidity and citric acid,fixed acidity and density, citric acid and
volatile acidity.

Even though there shows significant correlation between quality and other
variables, correlation does not mean causation.We cannot say that higher alcohol
content gives better quality of wine.We can only conclude that there is
considerable amount of alcohol in red wines having higher quality.

Future improvement can be done by increasing the number of red wine samples.
Here in this data set, the occurences of wines of quality 5 and 6 are
significantly higher than that of the wines of other qualities.The data must
collected for very low quality wines and very high quality wines. In this
data set, samples of wines having quality 3,4 and 8 are very low.